Grid Search

python
datacamp
machine learning
deep learning
hyperparameter
gridsearch
Author

kakamana

Published

April 9, 2023

Grid Search with Scikit Learn

  • Steps in a Grid Search
    • An algorithm to tune the hyperparameters (or estimator)
    • Defining which hyperparameters to tune
    • Defining a range of values for each hyperparameter
    • Setting a cross-validatoin scheme
    • Defining a score function so we can decide which square on our grid was ‘the best’
    • Include extra useful information or functions

GridSearchCV with Scikit Learn

The GridSearchCV module from Scikit Learn provides many useful features to assist with efficiently undertaking a grid search. You will now put your learning into practice by creating a GridSearchCV object with certain parameters.

The desired options are:

  • A Random Forest Estimator, with the split criterion as ‘entropy’
  • 5-fold cross validation
  • The hyperparameters max_depth (2, 4, 8, 15) and max_features (‘auto’ vs ‘sqrt’)
  • Use roc_auc to score the models
  • Use 4 cores for processing in parallel
  • Ensure you refit the best model and return training scores
Code
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Create a Random Forest Classifier with specified criterion
rf_class = RandomForestClassifier(criterion='entropy')

# Create the parametergrid
param_grid = {
    'max_depth':[2, 4, 8, 15],
    'max_features':['auto', 'sqrt']
}

# Create a GridSearchCV object
grid_rf_class = GridSearchCV(
    estimator=rf_class,
    param_grid=param_grid,
    scoring='roc_auc',
    n_jobs=4,
    cv=5,
    refit=True,
    return_train_score=True
)

print(grid_rf_class)
GridSearchCV(cv=5, estimator=RandomForestClassifier(criterion='entropy'),
             n_jobs=4,
             param_grid={'max_depth': [2, 4, 8, 15],
                         'max_features': ['auto', 'sqrt']},
             return_train_score=True, scoring='roc_auc')

Understanding a grid search output

Exploring the grid search results

You will now explore the cv_results_ property of the GridSearchCV object defined in the video. This is a dictionary that we can read into a pandas DataFrame and contains a lot of useful information about the grid search we just undertook.

A reminder of the different column types in this property:

  • time_ columns
  • param_ columns (one for each hyperparameter) and the singular params column (with all hyperparameter settings)
  • a train_score column for each cv fold including the mean_train_score and std_train_score columns
  • a test_score column for each cv fold including the mean_test_score and std_test_score columns
  • a rank_test_score column with a number from 1 to n (number of iterations) ranking the rows based on their mean_test_score
Code
grid_rf_class.fit(X_train, y_train)

# Read the cv_results property into adataframe & print it out
cv_results_df = pd.DataFrame(grid_rf_class.cv_results_)
print(cv_results_df)

# Extract and print the column with a dictionary of hyperparameters used
column = cv_results_df.loc[:, ["params"]]
print(column)

# Extract and print the row that had the best mean test score
best_row = cv_results_df[cv_results_df['rank_test_score'] == 1]
print(best_row)
/Users/kakamana/opt/anaconda3/lib/python3.9/site-packages/sklearn/ensemble/_forest.py:424: FutureWarning: `max_features='auto'` has been deprecated in 1.1 and will be removed in 1.3. To keep the past behaviour, explicitly set `max_features='sqrt'` or remove this parameter as it is also the default value for RandomForestClassifiers and ExtraTreesClassifiers.
  warn(
/Users/kakamana/opt/anaconda3/lib/python3.9/site-packages/sklearn/ensemble/_forest.py:424: FutureWarning: `max_features='auto'` has been deprecated in 1.1 and will be removed in 1.3. To keep the past behaviour, explicitly set `max_features='sqrt'` or remove this parameter as it is also the default value for RandomForestClassifiers and ExtraTreesClassifiers.
  warn(
/Users/kakamana/opt/anaconda3/lib/python3.9/site-packages/sklearn/ensemble/_forest.py:424: FutureWarning: `max_features='auto'` has been deprecated in 1.1 and will be removed in 1.3. To keep the past behaviour, explicitly set `max_features='sqrt'` or remove this parameter as it is also the default value for RandomForestClassifiers and ExtraTreesClassifiers.
  warn(
/Users/kakamana/opt/anaconda3/lib/python3.9/site-packages/sklearn/ensemble/_forest.py:424: FutureWarning: `max_features='auto'` has been deprecated in 1.1 and will be removed in 1.3. To keep the past behaviour, explicitly set `max_features='sqrt'` or remove this parameter as it is also the default value for RandomForestClassifiers and ExtraTreesClassifiers.
  warn(
/Users/kakamana/opt/anaconda3/lib/python3.9/site-packages/sklearn/ensemble/_forest.py:424: FutureWarning: `max_features='auto'` has been deprecated in 1.1 and will be removed in 1.3. To keep the past behaviour, explicitly set `max_features='sqrt'` or remove this parameter as it is also the default value for RandomForestClassifiers and ExtraTreesClassifiers.
  warn(
/Users/kakamana/opt/anaconda3/lib/python3.9/site-packages/sklearn/ensemble/_forest.py:424: FutureWarning: `max_features='auto'` has been deprecated in 1.1 and will be removed in 1.3. To keep the past behaviour, explicitly set `max_features='sqrt'` or remove this parameter as it is also the default value for RandomForestClassifiers and ExtraTreesClassifiers.
  warn(
/Users/kakamana/opt/anaconda3/lib/python3.9/site-packages/sklearn/ensemble/_forest.py:424: FutureWarning: `max_features='auto'` has been deprecated in 1.1 and will be removed in 1.3. To keep the past behaviour, explicitly set `max_features='sqrt'` or remove this parameter as it is also the default value for RandomForestClassifiers and ExtraTreesClassifiers.
  warn(
/Users/kakamana/opt/anaconda3/lib/python3.9/site-packages/sklearn/ensemble/_forest.py:424: FutureWarning: `max_features='auto'` has been deprecated in 1.1 and will be removed in 1.3. To keep the past behaviour, explicitly set `max_features='sqrt'` or remove this parameter as it is also the default value for RandomForestClassifiers and ExtraTreesClassifiers.
  warn(
/Users/kakamana/opt/anaconda3/lib/python3.9/site-packages/sklearn/ensemble/_forest.py:424: FutureWarning: `max_features='auto'` has been deprecated in 1.1 and will be removed in 1.3. To keep the past behaviour, explicitly set `max_features='sqrt'` or remove this parameter as it is also the default value for RandomForestClassifiers and ExtraTreesClassifiers.
  warn(
/Users/kakamana/opt/anaconda3/lib/python3.9/site-packages/sklearn/ensemble/_forest.py:424: FutureWarning: `max_features='auto'` has been deprecated in 1.1 and will be removed in 1.3. To keep the past behaviour, explicitly set `max_features='sqrt'` or remove this parameter as it is also the default value for RandomForestClassifiers and ExtraTreesClassifiers.
  warn(
/Users/kakamana/opt/anaconda3/lib/python3.9/site-packages/sklearn/ensemble/_forest.py:424: FutureWarning: `max_features='auto'` has been deprecated in 1.1 and will be removed in 1.3. To keep the past behaviour, explicitly set `max_features='sqrt'` or remove this parameter as it is also the default value for RandomForestClassifiers and ExtraTreesClassifiers.
  warn(
/Users/kakamana/opt/anaconda3/lib/python3.9/site-packages/sklearn/ensemble/_forest.py:424: FutureWarning: `max_features='auto'` has been deprecated in 1.1 and will be removed in 1.3. To keep the past behaviour, explicitly set `max_features='sqrt'` or remove this parameter as it is also the default value for RandomForestClassifiers and ExtraTreesClassifiers.
  warn(
/Users/kakamana/opt/anaconda3/lib/python3.9/site-packages/sklearn/ensemble/_forest.py:424: FutureWarning: `max_features='auto'` has been deprecated in 1.1 and will be removed in 1.3. To keep the past behaviour, explicitly set `max_features='sqrt'` or remove this parameter as it is also the default value for RandomForestClassifiers and ExtraTreesClassifiers.
  warn(
/Users/kakamana/opt/anaconda3/lib/python3.9/site-packages/sklearn/ensemble/_forest.py:424: FutureWarning: `max_features='auto'` has been deprecated in 1.1 and will be removed in 1.3. To keep the past behaviour, explicitly set `max_features='sqrt'` or remove this parameter as it is also the default value for RandomForestClassifiers and ExtraTreesClassifiers.
  warn(
/Users/kakamana/opt/anaconda3/lib/python3.9/site-packages/sklearn/ensemble/_forest.py:424: FutureWarning: `max_features='auto'` has been deprecated in 1.1 and will be removed in 1.3. To keep the past behaviour, explicitly set `max_features='sqrt'` or remove this parameter as it is also the default value for RandomForestClassifiers and ExtraTreesClassifiers.
  warn(
/Users/kakamana/opt/anaconda3/lib/python3.9/site-packages/sklearn/ensemble/_forest.py:424: FutureWarning: `max_features='auto'` has been deprecated in 1.1 and will be removed in 1.3. To keep the past behaviour, explicitly set `max_features='sqrt'` or remove this parameter as it is also the default value for RandomForestClassifiers and ExtraTreesClassifiers.
  warn(
/Users/kakamana/opt/anaconda3/lib/python3.9/site-packages/sklearn/ensemble/_forest.py:424: FutureWarning: `max_features='auto'` has been deprecated in 1.1 and will be removed in 1.3. To keep the past behaviour, explicitly set `max_features='sqrt'` or remove this parameter as it is also the default value for RandomForestClassifiers and ExtraTreesClassifiers.
  warn(
/Users/kakamana/opt/anaconda3/lib/python3.9/site-packages/sklearn/ensemble/_forest.py:424: FutureWarning: `max_features='auto'` has been deprecated in 1.1 and will be removed in 1.3. To keep the past behaviour, explicitly set `max_features='sqrt'` or remove this parameter as it is also the default value for RandomForestClassifiers and ExtraTreesClassifiers.
  warn(
/Users/kakamana/opt/anaconda3/lib/python3.9/site-packages/sklearn/ensemble/_forest.py:424: FutureWarning: `max_features='auto'` has been deprecated in 1.1 and will be removed in 1.3. To keep the past behaviour, explicitly set `max_features='sqrt'` or remove this parameter as it is also the default value for RandomForestClassifiers and ExtraTreesClassifiers.
  warn(
/Users/kakamana/opt/anaconda3/lib/python3.9/site-packages/sklearn/ensemble/_forest.py:424: FutureWarning: `max_features='auto'` has been deprecated in 1.1 and will be removed in 1.3. To keep the past behaviour, explicitly set `max_features='sqrt'` or remove this parameter as it is also the default value for RandomForestClassifiers and ExtraTreesClassifiers.
  warn(
/Users/kakamana/opt/anaconda3/lib/python3.9/site-packages/sklearn/ensemble/_forest.py:424: FutureWarning: `max_features='auto'` has been deprecated in 1.1 and will be removed in 1.3. To keep the past behaviour, explicitly set `max_features='sqrt'` or remove this parameter as it is also the default value for RandomForestClassifiers and ExtraTreesClassifiers.
  warn(
   mean_fit_time  std_fit_time  mean_score_time  std_score_time  \
0       0.553439      0.009132         0.013463        0.000448   
1       0.531026      0.009878         0.013239        0.000555   
2       0.914235      0.011106         0.017581        0.000717   
3       0.945417      0.020631         0.017516        0.000926   
4       1.644313      0.017119         0.026537        0.000673   
5       1.627506      0.009529         0.026094        0.000421   
6       2.613335      0.021025         0.041663        0.000498   
7       2.599870      0.028571         0.041664        0.000323   

  param_max_depth param_max_features  \
0               2               auto   
1               2               sqrt   
2               4               auto   
3               4               sqrt   
4               8               auto   
5               8               sqrt   
6              15               auto   
7              15               sqrt   

                                      params  split0_test_score  \
0   {'max_depth': 2, 'max_features': 'auto'}           0.766386   
1   {'max_depth': 2, 'max_features': 'sqrt'}           0.763033   
2   {'max_depth': 4, 'max_features': 'auto'}           0.770331   
3   {'max_depth': 4, 'max_features': 'sqrt'}           0.769073   
4   {'max_depth': 8, 'max_features': 'auto'}           0.774115   
5   {'max_depth': 8, 'max_features': 'sqrt'}           0.772641   
6  {'max_depth': 15, 'max_features': 'auto'}           0.766062   
7  {'max_depth': 15, 'max_features': 'sqrt'}           0.769417   

   split1_test_score  split2_test_score  ...  mean_test_score  std_test_score  \
0           0.762023           0.763215  ...         0.766701        0.003842   
1           0.762789           0.761229  ...         0.765009        0.003320   
2           0.766002           0.766667  ...         0.771341        0.004908   
3           0.766696           0.768804  ...         0.771745        0.004521   
4           0.770093           0.777018  ...         0.777718        0.005363   
5           0.769696           0.774317  ...         0.776565        0.005625   
6           0.765904           0.773864  ...         0.774454        0.007813   
7           0.767425           0.775033  ...         0.775002        0.005939   

   rank_test_score  split0_train_score  split1_train_score  \
0                7            0.768926            0.770673   
1                8            0.770699            0.768162   
2                6            0.779433            0.779525   
3                5            0.777827            0.780215   
4                1            0.829605            0.830069   
5                2            0.828698            0.827135   
6                4            0.974661            0.973459   
7                3            0.975158            0.971872   

   split2_train_score  split3_train_score  split4_train_score  \
0            0.770016            0.768063            0.769712   
1            0.768304            0.767662            0.766671   
2            0.777968            0.777477            0.777491   
3            0.779218            0.777710            0.777653   
4            0.828715            0.826782            0.825327   
5            0.827792            0.825685            0.827783   
6            0.972964            0.973317            0.975256   
7            0.974634            0.972400            0.974306   

   mean_train_score  std_train_score  
0          0.769478         0.000903  
1          0.768300         0.001329  
2          0.778379         0.000916  
3          0.778525         0.001025  
4          0.828100         0.001786  
5          0.827418         0.001000  
6          0.973932         0.000874  
7          0.973674         0.001296  

[8 rows x 22 columns]
                                      params
0   {'max_depth': 2, 'max_features': 'auto'}
1   {'max_depth': 2, 'max_features': 'sqrt'}
2   {'max_depth': 4, 'max_features': 'auto'}
3   {'max_depth': 4, 'max_features': 'sqrt'}
4   {'max_depth': 8, 'max_features': 'auto'}
5   {'max_depth': 8, 'max_features': 'sqrt'}
6  {'max_depth': 15, 'max_features': 'auto'}
7  {'max_depth': 15, 'max_features': 'sqrt'}
   mean_fit_time  std_fit_time  mean_score_time  std_score_time  \
4       1.644313      0.017119         0.026537        0.000673   

  param_max_depth param_max_features  \
4               8               auto   

                                     params  split0_test_score  \
4  {'max_depth': 8, 'max_features': 'auto'}           0.774115   

   split1_test_score  split2_test_score  ...  mean_test_score  std_test_score  \
4           0.770093           0.777018  ...         0.777718        0.005363   

   rank_test_score  split0_train_score  split1_train_score  \
4                1            0.829605            0.830069   

   split2_train_score  split3_train_score  split4_train_score  \
4            0.828715            0.826782            0.825327   

   mean_train_score  std_train_score  
4            0.8281         0.001786  

[1 rows x 22 columns]

Analyzing the best results

At the end of the day, we primarily care about the best performing ‘square’ in a grid search. Luckily Scikit Learn’s gridSearchCV objects have a number of parameters that provide key information on just the best square (or row in cv_results_).

Three properties you will explore are:

  • best_score_ – The score (here ROC_AUC) from the best-performing square.
  • best_index_ – The index of the row in cv_results_ containing information on the best-performing square.
  • best_params_ – A dictionary of the parameters that gave the best score, for example ‘max_depth’: 10
Code
best_score = grid_rf_class.best_score_
print(best_score)

# Create a variable from the row related to the best-performing square
cv_results_df = pd.DataFrame(grid_rf_class.cv_results_)
best_row = cv_results_df.loc[[grid_rf_class.best_index_]]
print(best_row)

# Get the max_depth parameter from the best-performing square and print
best_max_depth = grid_rf_class.best_params_['max_depth']
print(best_max_depth)
0.777717676012218
   mean_fit_time  std_fit_time  mean_score_time  std_score_time  \
4       1.644313      0.017119         0.026537        0.000673   

  param_max_depth param_max_features  \
4               8               auto   

                                     params  split0_test_score  \
4  {'max_depth': 8, 'max_features': 'auto'}           0.774115   

   split1_test_score  split2_test_score  ...  mean_test_score  std_test_score  \
4           0.770093           0.777018  ...         0.777718        0.005363   

   rank_test_score  split0_train_score  split1_train_score  \
4                1            0.829605            0.830069   

   split2_train_score  split3_train_score  split4_train_score  \
4            0.828715            0.826782            0.825327   

   mean_train_score  std_train_score  
4            0.8281         0.001786  

[1 rows x 22 columns]
8

Using the best results

While it is interesting to analyze the results of our grid search, our final goal is practical in nature; we want to make predictions on our test set using our estimator object.

We can access this object through the best_estimator_ property of our grid search object.

In this exercise we will take a look inside the best_estimator_ property and then use this to make predictions on our test set for credit card defaults and generate a variety of scores. Remember to use predict_proba rather than predict since we need probability values rather than class labels for our roc_auc score. We use a slice [:,1] to get probabilities of the positive class.

Code
from sklearn.metrics import confusion_matrix, roc_auc_score

# See what type of object the best_estimator_property is
print(type(grid_rf_class.best_estimator_))

# Create an array of predictions directly using the best_estimator_property
predictions = grid_rf_class.best_estimator_.predict(X_test)

# Take a look to confirm it worked, this should be an array of 1's and 0's
print(predictions[0:5])

# Now create a confusion matrix
print("Confusion Matrix \n", confusion_matrix(y_test, predictions))

# Get the ROC-AUC score
predictions_proba = grid_rf_class.best_estimator_.predict_proba(X_test)[:, 1]
print("ROC-AUC Score \n", roc_auc_score(y_test, predictions_proba))
<class 'sklearn.ensemble._forest.RandomForestClassifier'>
[0 0 0 0 0]
Confusion Matrix 
 [[6712  331]
 [1248  709]]
ROC-AUC Score 
 0.7819188805230386